In this exercise, we will be using functions from the tidyverse, performance, gtsummary, broom, and emmeans packages.

library(tidyverse)
library(performance)
library(gtsummary)
library(emmeans)
library(broom)

(a) Fit a one-way ANOVA to a wine-growing experiment

The file pinot.csv contains the results of an experiment by winemaker Vincent Lakey, comparing a standard herbicide with two greener alternatives: straw mulch and compost.

The experiment used these three treatments in each of six different areas of the vineyard, recorded in the variable called block in the data set, pinot.csv.

For this exercise, we will fit a one-way ANOVA to the response variable Weight2001 (the weight of grapes harvested in 2001, in kg) using the explanatory variable Treatment.

Make a residual plot using check_model(). (To make the residual plot look nicer in the knitted output, you may want to change fig.width and fig.height - look at the solution for an example.)

Use joint_tests() to test the hypothesis that all treatments resulted in equal mean harvest weights.

Use emmeans() to produce a table of means and confidence intervals for the three treatments.

Use pairs() to produce a table of pairwise comparisons, with confidence intervals and p-values. (Note that this would not normally be reported, as the ANOVA result provides no reason to believe treatment means differ.)

Extension: produce a plot of the means of the three treatments, with error bars showing 95% confidence intervals.

pinot <- read_csv("pinot.csv")
m <- lm(Weight2001 ~ Treatment, data = pinot)
check_model(m)

joint_tests(m)
 model term df1 df2 F.ratio p.value
 Treatment    2  15   1.663  0.2226
emmeans(m, "Treatment")
 Treatment emmean    SE df lower.CL upper.CL
 compost     3.93 0.498 15     2.87     4.99
 herbicide   3.27 0.498 15     2.21     4.33
 straw       4.56 0.498 15     3.50     5.62

Confidence level used: 0.95 
emmeans(m, "Treatment") %>%
  pairs(adjust = "none", infer = TRUE)
 contrast            estimate    SE df lower.CL upper.CL t.ratio p.value
 compost - herbicide    0.657 0.704 15   -0.843    2.157   0.933  0.3655
 compost - straw       -0.627 0.704 15   -2.127    0.873  -0.891  0.3872
 herbicide - straw     -1.283 0.704 15   -2.783    0.217  -1.824  0.0882

Confidence level used: 0.95 
emmeans(m, "Treatment") %>%
  as_tibble() %>%
  mutate(Treatment = fct_reorder(Treatment, emmean)) %>%
  ggplot(aes(y = Treatment, x = emmean, xmin = lower.CL, xmax = upper.CL)) +
  geom_errorbar(width = 0.5) +
  geom_point() +
  labs(x = "Mean harvest in 2001 (kg)") +
  scale_x_continuous(limits = c(2, 6)) +
  theme_minimal() +
  theme(panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank())

(b) Fit a linear regression to women’s Olympic 100m gold medal times vs time

Read in the file olympic_100m_results.csv and use filter() to select the results for women (Gender is W) gold medalists (Medal is G).

Fit a linear regression to result as a function of year.

Use model_performance() to obtain model summary statistics, e.g. R-squared.

Make a residual plot using check_model().

Obtain an estimate, confidence interval and p-value for the slope using tidy() or tbl_summary(). (Is the rounding appropriate? Look at the solutions for a way to control the number of decimal places.)

Plot the original data and use geom_smooth() to add a line of best fit.

Does this model appear to be a good fit to the data? Is it plausible for this relationship to continue into the future?

olympic_100m <- read_csv("olympic_100m_results.csv")
Rows: 138 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Gender, Event, Location, Medal, Name, Nationality
dbl (3): Year, Result, Time
lgl (1): Wind

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
olympic_100m_women_gold <- filter(olympic_100m, Gender == "W", Medal == "G")
m <- lm(Result ~ Year, data = olympic_100m_women_gold)
model_performance(m)
# Indices of model performance

AIC    |   AICc |    BIC |    R2 | R2 (adj.) |  RMSE | Sigma
------------------------------------------------------------
-4.118 | -2.518 | -1.285 | 0.807 |     0.795 | 0.185 | 0.196
check_model(m)

tbl_regression(m, estimate_fun = ~style_number(., digits = 3))
Characteristic Beta 95% CI1 p-value
Year -0.014 -0.018, -0.011 <0.001
1 CI = Confidence Interval
tidy(m, conf.int = TRUE)
# A tibble: 2 × 7
  term        estimate std.error statistic       p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>         <dbl>    <dbl>     <dbl>
1 (Intercept)  39.4      3.35        11.8  0.00000000136  32.3      46.5   
2 Year         -0.0143   0.00170     -8.42 0.000000181    -0.0179   -0.0107
ggplot(olympic_100m_women_gold,
       aes(x = Year, y = Result)) +
  geom_smooth(method = "lm") + 
  geom_point()
`geom_smooth()` using formula = 'y ~ x'

(c) Extension: fit an interaction between gender and time

Fit a linear model to both men and women’s 100m sprint gold medal times, with gender, year, and the interaction in the model.

Are the parameter estimates (from tbl_regression() or tidy()) easy to interpret?

Can you use emmeans() to obtain the mean for each gender?

This continuous-by-categorical interaction is best analysed using emmeans features we haven’t seen yet: emmeans(m, "Gender", at = list(Year = 2020)) obtains estimated means for each gender in the year 2020. emtrends(m, "Gender", "Year") obtains estimated slopes for each gender.

Use pairs() to obtain an estimate and confidence interval for the difference in slopes between genders.

olympic_100m_gold <- filter(olympic_100m, Medal == "G")
m <- lm(Result ~ Gender*Year, data = olympic_100m_gold)
check_model(m)
Variable `Component` is not in your data frame :/

tidy(m, conf.int = TRUE)
# A tibble: 4 × 7
  term         estimate std.error statistic  p.value conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)  35.0       2.26       15.5   6.09e-19 30.4      39.5    
2 GenderW       4.42      4.40        1.00  3.21e- 1 -4.47     13.3    
3 Year         -0.0126    0.00116   -10.9   8.05e-14 -0.0149   -0.0103 
4 GenderW:Year -0.00169   0.00224    -0.757 4.53e- 1 -0.00621   0.00282
tbl_regression(m, estimate_fun = ~style_number(., digits = 3))
Characteristic Beta 95% CI1 p-value
Gender
    M
    W 4.421 -4.467, 13.309 0.3
Year -0.013 -0.015, -0.010 <0.001
Gender * Year
    W * Year -0.002 -0.006, 0.003 0.5
1 CI = Confidence Interval
emmeans(m, "Gender", at = list(Year = 2020))
NOTE: Results may be misleading due to involvement in interactions
 Gender emmean    SE df lower.CL upper.CL
 M        9.54 0.084 42     9.37     9.71
 W       10.54 0.104 42    10.33    10.75

Confidence level used: 0.95 
emtrends(m, "Gender", "Year")
 Gender Year.trend      SE df lower.CL upper.CL
 M         -0.0126 0.00116 42  -0.0149  -0.0103
 W         -0.0143 0.00192 42  -0.0182  -0.0104

Confidence level used: 0.95 
emtrends(m, "Gender", "Year") %>%
  pairs(adjust = "none") %>%
  summary(infer = TRUE)
 contrast estimate      SE df lower.CL upper.CL t.ratio p.value
 M - W     0.00169 0.00224 42 -0.00282  0.00621   0.757  0.4534

Confidence level used: 0.95 

© 2021 Statistical Consulting Centre, The University of Melbourne.